5. Main Analysis - All
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc. . The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends. . As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates. . Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Zip Code, Neighborhood and Land Value Data - Adam
Sidewalk Cafe License Data - Marika
Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New Yor City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.
First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.
sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))
To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.
ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")
Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:
queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()
In queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.
Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.
sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.
sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))
Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.
ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")
The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:
ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")
Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.
We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.
map <- get_map( location = c(-73.9485424, 40.7454513), source = "stamen", zoom = 11, maptype="toner")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.745451,-73.948542&zoom=11&size=640x640&scale=2&maptype=terrain&sensor=false
## Map from URL : http://tile.stamen.com/toner/11/602/768.png
## Map from URL : http://tile.stamen.com/toner/11/603/768.png
## Map from URL : http://tile.stamen.com/toner/11/604/768.png
## Map from URL : http://tile.stamen.com/toner/11/602/769.png
## Map from URL : http://tile.stamen.com/toner/11/603/769.png
## Map from URL : http://tile.stamen.com/toner/11/604/769.png
## Map from URL : http://tile.stamen.com/toner/11/602/770.png
## Map from URL : http://tile.stamen.com/toner/11/603/770.png
## Map from URL : http://tile.stamen.com/toner/11/604/770.png
## Map from URL : http://tile.stamen.com/toner/11/602/771.png
## Map from URL : http://tile.stamen.com/toner/11/603/771.png
## Map from URL : http://tile.stamen.com/toner/11/604/771.png
Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)
## Warning: Removed 3 rows containing missing values (geom_point).
g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")
## Warning: Removed 3 rows containing non-finite values (stat_density2d).
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
## Warning: Removed 3 rows containing missing values (geom_point).
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION=="ACTIVE" | STATUS_CLASSIFICATION=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Active and New Licenses in New York City")
## Warning: Removed 2 rows containing missing values (geom_point).
bronx_map <- get_map("Bronx, NY", source = "stamen", zoom = 12, maptype="toner")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Bronx,+NY&zoom=12&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Bronx,%20NY&sensor=false
## Map from URL : http://tile.stamen.com/toner/12/1206/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1539.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1539.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1539.png
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Active and New Licenses in the Bronx")
## Warning: Removed 931 rows containing missing values (geom_point).